fix(train): support eval-only mode (--num-rollout 0) by EazyReal · Pull Request #2109 · THUDM/slime

EazyReal · 2026-06-20T18:06:27Z

What changed

get_optimizer_param_scheduler now computes the estimated training-iteration count once and clamps the scheduler-visible train_iters to at least 1 before deriving Megatron LR/WD schedule steps.

A CPU regression test (tests/test_eval_only_optimizer_scheduler.py) stubs Megatron's OptimizerParamScheduler, preserves its lr_decay_steps > 0 assertion, and is registered in the cpu-unittest matrix.

Why

train.py has an eval-only path for --num-rollout 0 with --eval-interval, but model and optimizer setup run before that branch. With num_rollout == 0, the old estimate produced train_iters == 0, then lr_decay_steps == 0, so Megatron aborted before eval could start.

The clamp only gives the scheduler a valid nonzero size for zero-estimated runs. It does not add training iterations; the training loop is still controlled by args.num_rollout. For normal configs that already estimate at least one optimizer step, the value is unchanged.

Validation

tests/test_eval_only_optimizer_scheduler.py covers both the eval-only startup case (num_rollout=0 no longer trips Megatron's scheduler assertion and sets train_iters == 1) and a normal training config where train_iters remains 16.

Fixes #1785

EazyReal · 2026-06-25T08:47:52Z

@zhuzilin could you review this one? Eval-only mode with --num-rollout 0 still constructs the Megatron optimizer scheduler, which rejects zero lr_decay_steps; this keeps the training loop at zero rollouts while giving the scheduler the smallest valid shape.

EazyReal force-pushed the fix/eval-only-num-rollout-zero branch from a658aac to 9e5f530 Compare June 20, 2026 18:21

EazyReal changed the title ~~fix: support eval-only mode (--num-rollout 0)~~ fix(train): support eval-only mode (--num-rollout 0) Jun 24, 2026

EazyReal force-pushed the fix/eval-only-num-rollout-zero branch from 9e5f530 to 1f59044 Compare June 24, 2026 03:18

fix(train): support eval-only mode (--num-rollout 0)

6f3d1d3

EazyReal force-pushed the fix/eval-only-num-rollout-zero branch from 1f59044 to 6f3d1d3 Compare June 24, 2026 04:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(train): support eval-only mode (--num-rollout 0)#2109

fix(train): support eval-only mode (--num-rollout 0)#2109
EazyReal wants to merge 1 commit into
THUDM:mainfrom
EazyReal:fix/eval-only-num-rollout-zero

EazyReal commented Jun 20, 2026 •

edited

Loading

Uh oh!

EazyReal commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

EazyReal commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why

Validation

Uh oh!

EazyReal commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

EazyReal commented Jun 20, 2026 •

edited

Loading